So far, we’ve been using just the simple mean to make predictions. Going forward, we’ll continue using the simple mean to make predictions, but now in a complicated way. Before, when we calculated conditional means, we did so in certain “groupings” of variables. When we run linear regression, we no longer need to do so. Instead, linear regression allows us to calculate the conditional mean of the outcome at every value of the predictor. If the predictor takes on just a few values, then that’s the number of conditional means that will be calculated. If the predictor is continuous and takes on a large number of values, we’ll still be able to calculate the conditional mean at every one of those values.
As an example, consider the following plot of weekly spending on entertainment (movies, video games, streaming, etc) as a function of hourly wages.
If we’re asked to summarize the relationship between wages and spending, we could use conditional means. That’s what the blue line above does: for each of the four quartiles of hourly wages, it provides the average spending on entertainment. We could expand this logic and include both more levels of hourly wages and other variables with which to calculate the mean– for example, hourly wages and level of education. The problem is that this approach gives us lots of numbers back.
The graphic below fits a line (in red) to the data:
The line has the general function y=12+(2*x). We’re saying that if hourly wages were 0, the person would be predicted to spend 12 on entertainment (who knows where they got it). As hourly wages go up, the line predicts that weekly spending goes up $2 for every $1 increase in hourly wages. The line we fit to this data summarizes the relationship using just two numbers: the intercept (value of the y when x=0) and slope– the amount that y increases as a function of x. We call the slope the coefficient on x.
The general model we posit for regression is as follows:
\[Y=\alpha+\beta_1 x_1 +\beta_2 x_2+ ... \beta_k x_k + \epsilon\]
It’s just a linear, additive model. Y increases or decreases as a function of x, with multiple x’s included. \(\epsilon\) is the extent to which an individual value is above or below the line created. \(\beta\) is the amount that \(Y\) increases or decreases for a one unit change in \(x\). \(\alpha\) (sometimes referred to as \(\beta_0\)) is the intercept that tells the value of \(Y\) when all the \(x_1,\dots,x_k\) are equal to zero. With the simplifying assumption of linearity, we can summarize what we know about the relationship between the independent and dependent variable in a very parsimonious way. The trade-off is that an assumption of linearity may not always be warranted.
Note that this general model is the empiricist’s theory. We have to make choices about what \(x_1,\dots,x_k\) will be. For example, should we include wages and family structure into a model of spending on entertainment? Theoretically, we might think that, as your family gets bigger, you must spend more money on entertainment!
When we actually apply this model to data, we obtain predictions for both \(Y\) and each of the \(\beta\) parameters. We denote predicted values with a “hat”: \(\hat{Y}, \hat{\beta_1}\), etc. So how do we actually DO this?
We are going to go back to two previous datasets to get started. Our main focus will be on the college debt data from the beginning of the semester. Recall our previous work where we looked at the relationship between SAT scores and future earnings. All else equal, we expect that future earnings (\(Y\)) are positively related with SAT scores (\(X\)).
Let’s load the data and packages first.
require(tidyverse)
require(plotly)
sc_debt <- read_rds('https://github.com/rweldzius/PSC4175_SUM2025/raw/main/Data/sc_debt.Rds')
Now let’s look at our two variables of interest before we do anything else. Are these continuous? Categorical?
sc_debt %>%
select(sat_avg,md_earn_wne_p6)
## # A tibble: 2,546 × 2
## sat_avg md_earn_wne_p6
## <int> <int>
## 1 939 25200
## 2 1234 35100
## 3 NA 30700
## 4 1319 36200
## 5 946 22600
## 6 1261 37400
## 7 NA 23100
## 8 NA 33400
## 9 1082 30100
## 10 1300 39500
## # ℹ 2,536 more rows
As we can see, they both appear to be continuous measures, although
we can see that many schools don’t report the average SAT scores of
their students. Let’s use the summary() function to count
how much missing data we are dealing with.
summary(sc_debt %>% select(sat_avg,md_earn_wne_p6))
## sat_avg md_earn_wne_p6
## Min. : 737 Min. : 10600
## 1st Qu.:1053 1st Qu.: 26100
## Median :1119 Median : 31500
## Mean :1141 Mean : 33028
## 3rd Qu.:1205 3rd Qu.: 37400
## Max. :1557 Max. :120400
## NA's :1317 NA's :240
240 schools don’t report the future earnings of their recent graduates, and a whopping 1,317 schools don’t report SAT scores!
Let’s now look at both variables with univariate plots.
sc_debt %>%
ggplot(aes(x = sat_avg)) +
geom_histogram()
sc_debt %>%
ggplot(aes(x = md_earn_wne_p6)) +
geom_histogram()
Both variables are mildly skewed, but not enough for us to worry about transforming them.
Now let’s plot them as a scatter plot.
sc_debt %>%
ggplot(aes(x = sat_avg,y = md_earn_wne_p6)) +
geom_point()
As we can see just by looking at the data, there is a positive
relationship, where schools with higher SAT averages produce recent
graduates with higher median earnings. However, note that there are a
few outliers. Let’s use plotly to help is identify these
outliers.
p <- sc_debt %>%
ggplot(aes(x = sat_avg,y = md_earn_wne_p6,text = instnm)) +
geom_point()
ggplotly(p)
The University of Health Sciences and Pharmacy in St. Louis produces really wealthy recent graduates, despite having a middle-of-the-pack SAT score! (Why might this be?)
Now let’s overlay the line of best fit using
geom_smooth(method = 'lm').
sc_debt %>%
ggplot(aes(x = sat_avg,y = md_earn_wne_p6)) +
geom_point() +
geom_smooth(method = 'lm')
This line tells us that, on average, schools with average SAT scores of 1,000 produce recent graduates with incomes of roughly $30,000 annually.
Similarly, schools with average SAT scores of 1,400 produce graduates with median incomes of roughly $45,000.
Under the hood, ggplot is using a specific function to
draw this line called the lm() function, which stands for
“linear model”. The function is choosing values of \(\alpha\) and \(\beta\) that minimize the errors for every
school in the dataset. We can access this function directly
ourselves!
The lm() function takes two inputs that we must define.
The first is the formula, and the second is the
data.
The formula is literally just us writing the regression
equation from above, but in terms R can understand. It
takes the format of Y ~ X, which means \(Y = \alpha + \beta X\). (We don’t need to
tell R about \(\alpha\)
and \(\beta\) in the
formula…it calculates these values for us.)
In our setting, we want to set \(Y\)
to md_earn_wne_p6 and \(X\) to sat_avg, just like we
drew them on the axes in our plot. Thus our formula becomes
md_earn_wne_p6 ~ sat_avg. We also need to tell
R which data we are using, in this case our
sc_debt object. We then save the entire regression model to
an object, that I’ll call model_earn_sat
model_earn_sat <- lm(formula = md_earn_wne_p6 ~ sat_avg,data = sc_debt)
We can then look at the model results using the
summary() function.
summary(model_earn_sat)
##
## Call:
## lm(formula = md_earn_wne_p6 ~ sat_avg, data = sc_debt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23239 -4311 -852 2893 78695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12053.87 1939.80 -6.214 7.12e-10 ***
## sat_avg 42.60 1.69 25.203 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7594 on 1196 degrees of freedom
## (1348 observations deleted due to missingness)
## Multiple R-squared: 0.3469, Adjusted R-squared: 0.3463
## F-statistic: 635.2 on 1 and 1196 DF, p-value: < 2.2e-16
The model gives us a lot of information. The results tell us that the
(Intercept) (which is the same as \(\alpha\) in our theory equation, and \(\hat{\alpha}\) in our results) is equal to
-12054.87, and that the sat_avg (which is the same as \(\beta\) in our theory equation, and \(\hat{\beta}\) in our results) is equal to
42.60.
Substantively, we interpret these as follows:
Obviously, the \(\alpha\) value (the
(Intercept)) is somewhat meaningless. There are no schools
whose average SAT scores are zero…the minimum possible SAT score can’t
even be zero!
But the \(\beta\) value is very interesting and directly speaks to the research question, theory, and hypothesis! According to this model, each additional point on the SATs yields an increase in earnings of $42!
There are many other numbers in this table, including the Standard
Error (Std. Error), the t-statistic (t value),
and the p-value (Pr(>|t|)). It turns out that the
t-statistic is just the coefficient (Estimate) divided by
the standard error (Std. Error), and the p-value is a way
of converting the t-statistic into a measure of uncertainty! You don’t
need to know these steps in detail for this course. However, you do need
to remember that the p-value is just 1 minus the confidence level. The
smaller the p-value, the more confident we are that the
\(\beta\) coefficient is not zero
(i.e., that there is a relationship).
In this case, the p-value is basically zero, meaning we are basically
100% confident that the relationship between md_earn_wne_p6
and sat_avg is positive!
Let’s ask a different research question: what is the relationship between future earnings and the admissions rate?
Theory: Can you come up with a theory? I would assume that more selective schools have more rigorous training, which then leads to wealthier recent graduates.
Hypothesis: What is your hypothesis? Mine is that the relationship between the admissions rate and future earnings is negative.
Let’s test it!
First, we want to again look at both univariate and multivariate visualizations of our \(Y\) and \(X\) variables.
sc_debt %>%
ggplot(aes(x = adm_rate)) +
geom_histogram()
sc_debt %>%
ggplot(aes(x = md_earn_wne_p6)) +
geom_histogram()
As above, there is some mild skew, but nothing to be too worried about.
Let’s also check on missingness, just so we know how many schools we don’t have in our data.
summary(sc_debt %>% select(md_earn_wne_p6,adm_rate))
## md_earn_wne_p6 adm_rate
## Min. : 10600 Min. :0.0000
## 1st Qu.: 26100 1st Qu.:0.5668
## Median : 31500 Median :0.7115
## Mean : 33028 Mean :0.6791
## 3rd Qu.: 37400 3rd Qu.:0.8333
## Max. :120400 Max. :1.0000
## NA's :240 NA's :958
As above, 240 schools don’t report recent graduate earnings, and almost 1,000 don’t report on their admissions rate! This is an important aspect of the data to recognize, since it limits how generalizable our conclusions might be.
Now let’s plot the multivariate relationship.
sc_debt %>%
ggplot(aes(x = adm_rate,y = md_earn_wne_p6))+
geom_point() +
geom_smooth(method = 'lm')
We find some evidence of a downward sloping relationship, although it is less steep than what we saw with SAT scores.
Quick Exercise: Now let’s run the regression again. What do you conclude? How confident are you in this conclusion?
# INSERT CODE HERE
Write your answer here.